Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented May 26, 2025

Make sure to read the contributing guidelines before submitting a PR

Summary by CodeRabbit

  • New Features

    • Added support for multimodal models with vision and audio encoders, including Qwen2.5-Omni models.
    • Introduced separate handling and processing for vision and audio hyperparameters.
    • Updated tensor processing to accommodate multimodal tensor names and prefixes.
    • Enabled generation of extra positional embeddings for audio towers.
    • Extended initialization to support separate vision and audio contexts.
    • Added example commands demonstrating multimodal input usage in documentation.
  • Improvements

    • Modularized vision and audio support for cleaner, modality-agnostic code.
    • Unified tensor mapping and tensor modification logic.
    • Improved tokenization and boundary marking for mixed media inputs.
    • Enhanced test scripts to handle vision and audio inputs distinctly.
  • Bug Fixes

    • Fixed error handling for missing modalities during encoding.
    • Corrected prompt construction to accurately reflect multiple images or audio clips.

@coderabbitai
Copy link

coderabbitai bot commented May 26, 2025

Walkthrough

The changes introduce support for multimodal models with separate vision and audio encoders, refactoring code to handle distinct hyperparameters, tensor mappings, and model classes. They include updates to Python scripts, C++ headers, and tests to enable and test combined vision-audio capabilities, especially for Qwen2.5-Omni models.

Changes

File(s) Change Summary
convert_hf_to_gguf.py Modularized vision/audio hyperparameters, enabled multimodal support, added Qwen2.5-Omni model/tensor handling.
gguf-py/gguf/constants.py Added QWEN25O constant to VisionProjectorType for Qwen2.5-Omni.
gguf-py/gguf/tensor_mapping.py Extended tensor mappings for audio tower, added comments on pseudo-prefixes for audio tensors.
tools/mtmd/clip-impl.h Added PROJECTOR_TYPE_QWEN25O to projector enum and names map.
tools/mtmd/clip.cpp Unified vision/audio model structures, separated context loading, updated accessors and logic for modalities.
tools/mtmd/clip.h Introduced clip_modality enum, split context params, changed clip_init to return both vision/audio contexts.
tools/mtmd/mtmd-helper.cpp Added 1D M-RoPE positional encoding for audio tokens, updated chunk handling logic for audio/image distinction.
tools/mtmd/mtmd.cpp Split context into vision/audio, refactored tokenization into a class, updated encoding and support checks.
tools/mtmd/mtmd-cli.cpp Modified prompt construction to append multimodal markers per image, not just once.
tools/mtmd/tests.sh Refactored test setup to use separate functions for vision/audio, simplified binary handling, updated input files.
docs/multimodal.md Added examples for using multimodal Qwen2.5 Omni models with audio and vision inputs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant mtmd_context
    participant clip_ctx_vision
    participant clip_ctx_audio

    User->>CLI: Provide prompt, images, audio
    CLI->>mtmd_context: Initialize
    mtmd_context->>clip_ctx_vision: Load vision model (if present)
    mtmd_context->>clip_ctx_audio: Load audio model (if present)
    CLI->>mtmd_context: Tokenize input (text, images, audio)
    mtmd_context->>clip_ctx_vision: Encode images (if any)
    mtmd_context->>clip_ctx_audio: Encode audio (if any)
    mtmd_context-->>CLI: Return tokenized/encoded input
    CLI-->>User: Output result
Loading
sequenceDiagram
    participant convert_hf_to_gguf
    participant MmprojModel
    participant Qwen25OmniModel

    convert_hf_to_gguf->>MmprojModel: Load model config
    MmprojModel->>MmprojModel: Extract vision/audio configs
    MmprojModel->>Qwen25OmniModel: If Qwen2.5-Omni, instantiate
    Qwen25OmniModel->>Qwen25OmniModel: Set GGUF params for vision/audio
    Qwen25OmniModel->>Qwen25OmniModel: Modify tensors for multimodal
    Qwen25OmniModel-->>convert_hf_to_gguf: Return processed model
Loading

Poem

🐇
A hop, a leap, now sights and sounds—
Vision and audio, together abound!
Models united, code refactored anew,
Markers for images, for audio too.
Tests now run with files of each kind,
Multimodal magic—what a find!
Hooray for the code, so clever and neat,
With every modality, we’re tough to beat!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94d893d and 27a8f26.

📒 Files selected for processing (3)
  • tools/mtmd/clip.cpp (45 hunks)
  • tools/mtmd/clip.h (1 hunks)
  • tools/mtmd/mtmd-helper.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tools/mtmd/clip.h
  • tools/mtmd/mtmd-helper.cpp
🧰 Additional context used
🧠 Learnings (1)
tools/mtmd/clip.cpp (1)

undefined

<retrieved_learning>
Learnt from: ngxson
PR: #25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
</retrieved_learning>

🧬 Code Graph Analysis (1)
tools/mtmd/clip.cpp (4)
tools/mtmd/mtmd.cpp (2)
  • GGML_ASSERT (190-273)
  • GGML_ASSERT (275-293)
tools/mtmd/clip-impl.h (4)
  • clip_projector_type_from_string (155-162)
  • clip_projector_type_from_string (155-155)
  • string_format (295-308)
  • string_format (295-295)
common/common.cpp (2)
  • string_format (386-399)
  • string_format (386-386)
tools/mtmd/clip.h (1)
  • clip_n_output_tokens (52-52)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: build-linux-cross / ubuntu-24-ppc64el-vulkan-cross
  • GitHub Check: windows-latest-cmake (llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-win...
  • GitHub Check: windows-latest-cmake (cpu-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-...
  • GitHub Check: windows-latest-cmake (vulkan-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_...
  • GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=c...
  • GitHub Check: windows-latest-cmake (openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-win...
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: windows-latest-cmake-sycl
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: pyright type-check
  • GitHub Check: check-requirements
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server-windows
🔇 Additional comments (1)
tools/mtmd/clip.cpp (1)

244-421: LGTM: Well-implemented structural refactoring

The unification of vision and audio modalities under the clip_model struct is well-designed. The addition of the proj_type() accessor method maintains backward compatibility while allowing internal structure changes. The consistent usage of ctx->model throughout the codebase shows careful refactoring.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ngxson
Copy link
Owner Author

ngxson commented May 26, 2025

@coderabbitai time to nits picking. first step, see if I make any typo error (i.e. broken english)

@coderabbitai

This comment was marked as resolved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
tools/mtmd/tests.sh (1)

86-87: ⚠️ Potential issue

Fix inconsistent function usage in huge tests

Lines 86-87 still use the old add_test function instead of the new add_test_vision function. This will cause errors since add_test is no longer defined.

Apply this fix:

-    add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
-    add_test "llama-mtmd-cli" "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
+    add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
+    add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
🧹 Nitpick comments (7)
tools/mtmd/mtmd-cli.cpp (1)

287-289: Consider improving marker logic for edge cases.

The current logic adds one marker per image only if the prompt contains NO markers. This could lead to inconsistent behavior:

  • Scenario 1: 3 images, 0 markers → adds 3 markers ✓
  • Scenario 2: 3 images, 1 marker → adds 0 markers ⚠️
  • Scenario 3: 3 images, 5 markers → adds 0 markers ⚠️

Consider a more robust approach:

-        if (params.prompt.find(mtmd_default_marker()) == std::string::npos) {
-            for (size_t i = 0; i < params.image.size(); i++) {
-                params.prompt += mtmd_default_marker();
-            }
-        }
+        // Count existing markers in the prompt
+        size_t marker_count = 0;
+        size_t pos = 0;
+        const std::string marker = mtmd_default_marker();
+        while ((pos = params.prompt.find(marker, pos)) != std::string::npos) {
+            marker_count++;
+            pos += marker.length();
+        }
+        
+        // Add markers for any missing images
+        for (size_t i = marker_count; i < params.image.size(); i++) {
+            params.prompt += marker;
+        }

Alternatively, document the current behavior clearly for users to understand marker placement expectations.

tools/mtmd/mtmd.cpp (3)

302-305: Null-check before freeing contexts

clip_free(nullptr) is undefined unless the helper guards internally. A trivial guard improves robustness and symmetry with other checks in the file.

~mtmd_context() {
-    clip_free(ctx_a);
-    clip_free(ctx_v);
+    if (ctx_a) clip_free(ctx_a);
+    if (ctx_v) clip_free(ctx_v);
 }

280-287: get_clip_ctx lacks TEXT branch -> unreachable abort

When the caller passes a text chunk accidentally, the function GGML_ABORTs, crashing the process instead of signalling a user-level error.

clip_ctx * get_clip_ctx(const mtmd_input_chunk * chunk) const {
    if (chunk->type == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
        return ctx_v;
    } else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_AUDIO) {
        return ctx_a;
+    } else if (chunk->type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
+        return nullptr;          // text chunks have no clip ctx
    }
    GGML_ABORT("unknown chunk type");
}

Failing fast is good, but crashing on an innocuous misuse hurts API ergonomics. Returning nullptr (caller can then error-out) is safer.


654-672: split_text can mis-count markers, causing misleading log output

The helper returns the delimiter itself as an element; however the “expected markers” reported elsewhere use parts.size() - 1, which doubles the actual count in ordinary strings ("a <m> b" → 3 parts). Consider counting the delimiter occurrences directly or returning only text fragments to avoid confusion.

This is purely logging, so not critical, but the message hampers debugging.

tools/mtmd/clip.cpp (1)

2589-2604: Warm-up allocation for audio encoded models can explode memory usage

alloc_compute_meta() builds a fake batch whose width is initialised as

img->nx = hparams.warmup_audio_size;   // default 3000
img->ny = hparams.n_mel_bins;

For speech models using 16 kHz audio this corresponds to 187.5 ms per frame, yet many users feed noticeably shorter snippets during inference.
Allocating the graph for 3 000 frames (~46 s) overly inflates both compute and weight buffers (especially on GPU back-ends with unified memory budgeting).

Consider tightening the default (3000400 ?) or – better – scaling it dynamically from ctx_params.max_audio_len / a CLI flag.
It will noticeably reduce start-up latency and VRAM footprint for typical use-cases.

convert_hf_to_gguf.py (2)

1127-1128: Add documentation for the newly added class attributes.

Consider adding docstrings to clarify the purpose of n_block_keys - it's not immediately clear that this list contains the possible key names for block count across different model configurations.

 class MmprojModel(ModelBase):
     model_type = ModelType.MMPROJ
     model_arch = gguf.MODEL_ARCH.MMPROJ
     preprocessor_config: dict[str, Any]
     global_config: dict[str, Any]
 
+    # List of possible key names for block count in different model configurations
     n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth"]

1132-1135: Consider extracting the separate vision and audio hparams initialization into dedicated methods.

The initialization logic for hparams_vision and hparams_audio could be extracted into separate methods for better maintainability and testability.

+    def _init_modality_hparams(self) -> None:
+        """Initialize vision and audio hyperparameters from the global config."""
+        self.hparams_vision = self.get_vision_config()
+        self.hparams_audio = self.get_audio_config()
+
 class MmprojModel(ModelBase):
     # ... existing code ...
     
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
         
         # ... existing initialization code ...
         
         # move vision config to the top level, while preserving the original hparams in global_config
         import copy
         self.global_config = copy.deepcopy(self.hparams)
-        self.hparams_vision = self.get_vision_config()
-        self.hparams_audio = self.get_audio_config()
+        self._init_modality_hparams()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fef693d and 1ac73f4.

⛔ Files ignored due to path filters (1)
  • tools/mtmd/test-2.mp3 is excluded by !**/*.mp3
📒 Files selected for processing (10)
  • convert_hf_to_gguf.py (8 hunks)
  • gguf-py/gguf/constants.py (1 hunks)
  • gguf-py/gguf/tensor_mapping.py (2 hunks)
  • tools/mtmd/clip-impl.h (2 hunks)
  • tools/mtmd/clip.cpp (45 hunks)
  • tools/mtmd/clip.h (1 hunks)
  • tools/mtmd/mtmd-cli.cpp (1 hunks)
  • tools/mtmd/mtmd-helper.cpp (3 hunks)
  • tools/mtmd/mtmd.cpp (13 hunks)
  • tools/mtmd/tests.sh (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
tools/mtmd/mtmd-cli.cpp (1)
tools/mtmd/mtmd.cpp (2)
  • mtmd_default_marker (82-84)
  • mtmd_default_marker (82-82)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (355-551)
tools/mtmd/mtmd-helper.cpp (3)
tools/mtmd/mtmd.cpp (13)
  • nx (30-30)
  • nx (34-42)
  • n_tokens (51-57)
  • GGML_ASSERT (174-262)
  • GGML_ASSERT (264-277)
  • mtmd_input_chunk_get_tokens_image (943-948)
  • mtmd_input_chunk_get_tokens_image (943-943)
  • chunk (280-287)
  • chunk (280-280)
  • mtmd_image_tokens_get_nx (1015-1017)
  • mtmd_image_tokens_get_nx (1015-1015)
  • mtmd_image_tokens_get_ny (1019-1021)
  • mtmd_image_tokens_get_ny (1019-1019)
tools/mtmd/mtmd.h (6)
  • nx (120-120)
  • chunk (148-148)
  • chunk (149-149)
  • chunk (159-159)
  • mtmd_image_tokens_get_nx (168-168)
  • mtmd_image_tokens_get_ny (169-169)
tools/server/utils.hpp (7)
  • GGML_ASSERT (1156-1159)
  • pos (1110-1117)
  • pos (1110-1110)
  • pos (1162-1165)
  • pos (1162-1162)
  • chunk (1127-1147)
  • chunk (1127-1127)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-cross
  • GitHub Check: build-linux-cross / ubuntu-24-arm64-vulkan-cross
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-vulkan-cross
  • GitHub Check: build-linux-cross / ubuntu-24-ppc64el-cpu-cross
  • GitHub Check: build-linux-cross / ubuntu-24-ppc64el-vulkan-cross
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: macOS-latest-swift (generic/platform=macOS)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: macOS-latest-cmake-tvos
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: check-requirements
  • GitHub Check: server-windows
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
🔇 Additional comments (27)
gguf-py/gguf/constants.py (1)

2263-2263: LGTM! Consistent constant addition for Qwen2.5-Omni support.

The new QWEN25O constant follows the existing naming pattern and is properly aligned with the corresponding C++ enum mapping in tools/mtmd/clip-impl.h. The comment clearly indicates this is for omni-modal support.

tools/mtmd/clip-impl.h (1)

133-133: LGTM! Proper enum and mapping addition for Qwen2.5-Omni projector type.

The enum value PROJECTOR_TYPE_QWEN25O is correctly positioned before PROJECTOR_TYPE_UNKNOWN, and the corresponding string mapping to "qwen2.5o" is consistent with the Python constant in gguf-py/gguf/constants.py. The formatting follows the existing code style.

Also applies to: 152-152

gguf-py/gguf/tensor_mapping.py (3)

1128-1128: LGTM: Audio tensor mapping addition

The addition of "audio_tower.ln_post" to the A_POST_NORM tensor mapping correctly extends support for Qwen2.5-Omni audio encoders.


1165-1166: Good documentation of pseudo-prefix convention

The comment clearly explains the purpose of the "audio." pseudo-prefix for preventing conflicts with vision tensors, which helps with code maintainability.


1174-1174: LGTM: Audio projection tensor mapping

The addition of "audio_tower.proj" to the A_MMPROJ_FC tensor mapping correctly supports the audio tower projection layer in Qwen2.5-Omni models.

tools/mtmd/clip.h (4)

20-23: LGTM: Clear modality enumeration

The clip_modality enum provides a clean way to distinguish between vision and audio processing modes.


27-27: LGTM: Modality field addition

Adding the modality field to clip_context_params allows the initialization function to determine the appropriate processing mode.


31-34: LGTM: Well-designed result structure

The clip_init_result struct with separate ctx_v and ctx_a pointers provides a clean way to return both vision and audio contexts from initialization.


36-36:

✅ Verification successful

Breaking API change - ensure all callers are updated

The function signature change from returning a single clip_ctx * to returning clip_init_result is a breaking change that will require updates to all calling code.


🏁 Script executed:

#!/bin/bash
# Search for all usages of clip_init to verify they've been updated
rg -A 5 -B 5 "clip_init\s*\(" --type cpp --type c

Length of output: 1816


No further action needed – all clip_init callers have been updated to use clip_init_result.

  • Verified in tools/mtmd/mtmd.cpp that the call to clip_init is unpacked via auto res = clip_init(...), with res.ctx_v and res.ctx_a being used.
tools/mtmd/mtmd-helper.cpp (3)

69-70: Good clarification of function purpose

Renaming to set_position_mrope_2d and adding the "for image" comment clearly indicates this function is specifically for 2D spatial positioning used by image tokens.


89-104: LGTM: Audio M-RoPE implementation

The 1D M-RoPE implementation for audio tokens correctly sets sequential positions for all three used dimensions while leaving the fourth dimension unused, consistent with the 2D version.


167-180: Robust chunk type handling for M-RoPE

The logic correctly distinguishes between image and audio chunks, calling the appropriate M-RoPE positioning function for each type, with proper error handling for unsupported chunk types.

tools/mtmd/tests.sh (4)

35-35: Good addition of file array

The arr_file array properly supports different input file types for vision and audio tests.


37-50: Well-designed specialized test functions

The add_test_vision and add_test_audio functions properly handle the different requirements for each modality (vision uses chat templates, audio doesn't; different input files).


101-101: LGTM: Simplified build target

Building only the unified llama-mtmd-cli binary aligns with the architectural consolidation.


105-118: LGTM: Updated test execution logic

The test loop correctly uses the new array structure and applies the appropriate input file for each test case.

tools/mtmd/clip.cpp (1)

2099-2107:

❓ Verification inconclusive

Possible prefix mismatch between hyper-parameter keys and tensor names

load_hparams() uses

const char * prefix = is_vision ? "vision" : "audio";

when querying numeric keys such as "vision_n_embd" / "audio_n_embd" (and similar).

Later on, load_tensors() derives a different prefix for the very same modality:

const char * prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v";

All subsequent tensor look-ups are built with string_format(..., prefix, …) and therefore expect names like "v_attn_q.weight" or "a_ln_1.bias".

If your GGUF conversion script exported:

  • hyper-parameter keys with "vision_" / "audio_" and
  • tensor names with "v_" / "a_"

then everything is fine.
However, if the exporter used the long form ("vision_" / "audio_") for both metadata and tensors (or vice-versa), the loader will silently fail to locate mandatory tensors, returning nullptr which will later cause hard-to-trace SIGSEGVs.

I recommend unifying the prefix logic in one place, e.g.

-const char * prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v";
+const char * prefix = model.modality == CLIP_MODALITY_AUDIO ? "audio" : "vision";

(or the opposite, depending on your GGUF convention) and providing an assertion after get_tensor() that no required tensor was missed.

Would you double-check the actual tensor names inside a representative GGUF file and confirm which style is used?
If you need a helper script to dump the tensor list I can provide one.

Also applies to: 2294-2300


🏁 Script executed:

#!/bin/bash
# Show the load_tensors implementation and look for prefix assignment
rg -n "load_tensors" -A20 tools/mtmd/clip.cpp

# Find all prefix definitions in clip.cpp to compare both usages
rg -n "const char \\* prefix" -n tools/mtmd/clip.cpp

Length of output: 2301


Ensure consistent prefix usage for hparams and tensors

There’s a mismatch between how you load hyper-parameters vs. tensors:

  • In load_hparams (around line 2104):
    const char *prefix = is_vision ? "vision" : "audio";
  • In load_tensors (line 2296):
    const char *prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v";

If your GGUF export uses the long form ("vision_"/"audio_") for both metadata and tensors (or the short form for both), the loader will silently fail to find tensors and crash later.

I recommend:

• Unify the prefixes (choose long or short) in one place, e.g. in load_tensors:

- const char *prefix = model.modality == CLIP_MODALITY_AUDIO ? "a" : "v";
+ const char *prefix = model.modality == CLIP_MODALITY_AUDIO ? "audio" : "vision";

• After each get_tensor():

auto *t = get_tensor(...);
assert(t && "missing tensor: " KEY(prefix) "_" name);

Could you please confirm which convention your GGUF file uses for tensor names? For example, you can run:

strings model.gguf | grep -E '"?(vision|audio|v|a)_' | sort -u

and share the output so we can lock in the correct prefix.

convert_hf_to_gguf.py (10)

435-438: LGTM! Clean configuration renaming for Qwen2.5-Omni models.

The code correctly handles the configuration structure change by renaming thinker_config to text_config for Qwen2.5-Omni models.


1172-1177: Excellent modular design for multimodal configuration handling.

The introduction of get_vision_config() and get_audio_config() methods provides a clean abstraction for accessing modality-specific configurations. This design allows for easy extension to support models with different configuration structures.


1216-1231: Well-designed parameter lookup methods with proper error handling.

The find_vparam, find_aparam, and _find_param methods provide a clean and consistent API for accessing modality-specific parameters. The assertion checks ensure that the respective configs exist before attempting to access them.


2709-2714: LGTM! Proper registration of Qwen2.5-Omni models.

The addition of "Qwen2_5OmniModel" to the model registration follows the established pattern.


2732-2738: Comprehensive tensor name processing for multimodal components.

The enhanced tensor name processing correctly handles the stripping of "thinker." prefixes and properly skips multimodal tensors (visual, audio, talker, token2wav). This ensures that only relevant tensors are processed.


2759-2777: Robust projector type handling for different model versions.

The code properly differentiates between different Qwen2 vision model types and sets the appropriate projector types. The handling of fullatt_block_indexes for window attention patterns is well-implemented with proper validation.


2828-2834: LGTM! Proper audio hyperparameter initialization.

The audio configuration is correctly extracted and mapped to the expected parameter names for consistency with the rest of the codebase.


2835-2840: Clean audio-specific GGUF parameter setup.

The audio-specific parameters are properly added to the GGUF writer with sensible defaults for the layer norm epsilon.


2847-2858: Well-implemented sinusoidal position embeddings for audio.

The implementation follows the standard sinusoidal position encoding formula with appropriate scaling. The fixed length of 1500 and max_timescale of 10000 appear to be model-specific constants.


2869-2878: Correct handling of audio tensor modifications.

The code properly handles:

  1. Unsqueezing bias tensors for conv layers to match expected dimensions
  2. Skipping the unused audio_bos_eos_token as documented in the transformers code reference

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tools/mtmd/tests.sh (1)

89-91: ⚠️ Potential issue

add_test is undefined – the huge branch will fail

After the refactor only add_test_vision / add_test_audio exist, but the huge section still calls the removed add_test helper, leading to “command not found” and an early script abort.

-    add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
-    add_test "llama-mtmd-cli" "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
+    add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
+    add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
♻️ Duplicate comments (2)
tools/mtmd/mtmd.cpp (1)

730-732: 🛠️ Refactor suggestion

Prefer querying the projector for its embedding size

Using ctx->n_embd_text assumes text and audio dimensions are forever equal.
Future adapters (e.g. 1×1 linear) could break this assumption.

-        int n_mmproj_embd = ctx->n_embd_text;
+        int n_mmproj_embd = clip_n_mmproj_embd(ctx->ctx_a);

This keeps the audio path self-contained and resilient.

tools/mtmd/clip.cpp (1)

4044-4085: ⚠️ Potential issue

Critical: Null-pointer dereference risk in clip_n_mmproj_embd()

Several projector branches blindly read tensor shapes without checking the pointer, e.g.

case PROJECTOR_TYPE_LDP:
    return ctx->model.mm_model_block_1_block_2_1_b->ne[0];

If a future GGUF is produced without optional bias terms (quite common for weight-only formats to save space) mm_model_block_1_block_2_1_b will be nullptr and the call will seg-fault the encoder during the very first size query.

Please guard every dereference with a nullptr check (or fall back to another tensor that is guaranteed to exist), e.g.

-case PROJECTOR_TYPE_LDP:
-    return ctx->model.mm_model_block_1_block_2_1_b->ne[0];
+case PROJECTOR_TYPE_LDP: {
+    GGML_ASSERT(ctx->model.mm_model_block_1_block_2_1_b &&
+                "Missing bias tensor for LDP projector");
+    return ctx->model.mm_model_block_1_block_2_1_b->ne[0];
+}

The same applies to the LDPV2, MLP, PIXTRAL, MINICPMV and several other cases below.

🧹 Nitpick comments (1)
docs/multimodal.md (1)

101-109: Clarify placeholder & flags for new mixed-modality examples

The newly-added commands still include the literal (tool_name) placeholder. Readers may miss the earlier note that this must be replaced with llama-mtmd-cli or llama-server.

Consider either:

  1. Replacing the placeholder with a concrete example (as done in lines 45-63), or
  2. Adding an inline reminder, e.g.
(tool_name)  # replace with llama-mtmd-cli or llama-server

Also mention whether an external --mmproj is required when the model is loaded via -m <local-gguf> (unlike the -hf path).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ac73f4 and bf34f38.

📒 Files selected for processing (4)
  • docs/multimodal.md (1 hunks)
  • tools/mtmd/clip.cpp (45 hunks)
  • tools/mtmd/mtmd.cpp (13 hunks)
  • tools/mtmd/tests.sh (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-cross
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-vulkan-cross
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: build-linux-cross / ubuntu-24-arm64-vulkan-cross
  • GitHub Check: android-build
  • GitHub Check: build-linux-cross / ubuntu-24-ppc64el-vulkan-cross
  • GitHub Check: macOS-latest-swift (generic/platform=macOS)
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ios-xcode-build
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: server-windows
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server (Release)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
🔇 Additional comments (9)
tools/mtmd/tests.sh (1)

108-123:

❌ Incorrect review comment

Audio files are passed with --image; verify correct CLI flag

--image $SCRIPT_DIR/$inp_file is used for every modality, but add_test_audio
pushes an MP3 file. Unless llama-mtmd-cli treats --image as a generic
media flag, this may silently skip audio processing or raise a MIME error.

Suggested defensive tweak:

flag=$( [[ $inp_file == *.mp3 ]] && echo "--audio" || echo "--image" )
"$PROJ_ROOT/build/bin/$bin" \
    -hf "$hf" \
    $flag "$SCRIPT_DIR/$inp_file" \
    …

Please confirm the intended CLI syntax.


🏁 Script executed:

#!/bin/bash
# Search for audio and image test registration and arr_file usage in tests.sh
rg --with-filename --no-heading --color never "add_test_audio" -n tools/mtmd/tests.sh
rg --with-filename --no-heading --color never "add_test_image" -n tools/mtmd/tests.sh
rg --with-filename --no-heading --color never "arr_file" -n tools/mtmd/tests.sh

Length of output: 653


🏁 Script executed:

#!/bin/bash
# Search for defined CLI flags for audio and image in the codebase
rg --with-filename --no-heading --color never "\-\-(audio|image)" -n .

# Look for argument parser definitions (Python argparse or Rust Clap)
rg --with-filename --no-heading --color never "add_argument" -n .
rg --with-filename --no-heading --color never "#\[derive(Parser)" -n .
rg --with-filename --no-heading --color never "clap::Parser" -n .

# Check README for any usage examples of --audio or --image
if [ -f README.md ]; then
  grep -nE '\-\-(audio|image)' README.md || echo "No audio/image flags in README"
fi

Length of output: 28697


🏁 Script executed:

#!/usr/bin/env bash
# Show context around the media FILE argument definition
rg -n -C5 '\{"--image", "--audio"\}' common/arg.cpp

Length of output: 618


CLI flags for image/audio are interchangeable
The argument parser in common/arg.cpp registers both --image and --audio as synonyms for the same FILE option (all values are stored in params.image), so passing an MP3 with --image will still be processed as audio. No changes are required—feel free to use --audio instead of --image for clarity.

Likely an incorrect or invalid review comment.

tools/mtmd/mtmd.cpp (1)

162-171: Great! Dimension-mismatch guard added

Validating that vision and audio projectors share the same n_embd prevents the
buffer over/under-run reported earlier. 👍

tools/mtmd/clip.cpp (7)

178-207: LGTM! Good hyperparameter structure extension for multimodal support.

The addition of image_mean, image_std arrays, warmup_audio_size, and legacy compatibility fields properly supports the transition to unified vision/audio processing.


244-247: Excellent refactoring to unified model structure.

The addition of modality and proj_type fields to clip_model enables clean support for both vision and audio modalities under a single structure, improving code maintainability.


417-422: Good backward compatibility design with proj_type() accessor.

The accessor function maintains API compatibility while enabling the internal refactoring to proceed cleanly.


2683-2717: Well-structured multimodal initialization.

The refactoring to return separate vision and audio contexts with proper error handling and cleanup is well implemented. The use of clip_model_loader to handle both modalities cleanly maintains the separation of concerns.


2067-2097: Robust modality-specific loading logic.

The sanity checks for modality support and special handling for PROJECTOR_TYPE_QWEN25O multimodal models demonstrate good understanding of the different model architectures. The logic correctly maps the generic multimodal type to specific vision/audio projector types.


2289-2589: Clean tensor loading with modality-aware prefixing.

The use of model.modality to determine tensor prefixes ("v" for vision, "a" for audio) enables clean reuse of the tensor loading logic for both modalities while maintaining proper namespace separation.


4087-4123: Excellent API compatibility maintenance.

The updated detection functions (clip_has_vision_encoder, clip_has_audio_encoder, etc.) properly use the new modality field and proj_type() accessor while maintaining backward compatibility with existing API consumers.

Comment on lines +275 to +293
void init_audio() {
GGML_ASSERT(ctx_a != nullptr);
projector_type proj = clip_get_projector_type(ctx_a);

if (clip_has_whisper_encoder(ctx_a)) {
// TODO @ngxson : check if model n_mel is 128 or 80
w_filters = whisper_precalc_filters::get_128_bins();
}

// warning messages
if (proj == PROJECTOR_TYPE_LLAMA4) {
LOG_WRN("%s: llama 4 vision is known to have degraded quality:\n"
" https://github.com/ggml-org/llama.cpp/pull/13282\n", __func__);
LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n"
" https://github.com/ggml-org/llama.cpp/discussions/13759\n", __func__);

if (proj == PROJECTOR_TYPE_QWEN2A) {
// <|audio_bos|> ... (embeddings) ... <|audio_eos|>
aud_beg = "<|audio_bos|>";
aud_end = "<|audio_eos|>";

}
if (has_audio) {
LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n"
" https://github.com/ggml-org/llama.cpp/discussions/13759\n", __func__);
}
Copy link

@coderabbitai coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing BOS/EOS tokens for Qwen-2.5 Omni audio

init_audio() sets aud_beg/aud_end only for PROJECTOR_TYPE_QWEN2A.
The new projector type PROJECTOR_TYPE_QWEN25O (added elsewhere) is ignored,
so audio chunks from Qwen2.5 Omni will be injected without boundary markers,
likely breaking generation.

 } else if (proj == PROJECTOR_TYPE_QWEN25O) {
     aud_beg = "<|audio_bos|>";
     aud_end = "<|audio_eos|>";
 }

(Adjust token strings to match the HF tokenizer if they differ.)

🤖 Prompt for AI Agents
In tools/mtmd/mtmd.cpp between lines 275 and 293, the init_audio() function only
sets the aud_beg and aud_end tokens for PROJECTOR_TYPE_QWEN2A, ignoring the new
PROJECTOR_TYPE_QWEN25O. This causes audio chunks from Qwen-2.5 Omni to lack
boundary markers, which may break generation. Add a condition to set aud_beg and
aud_end tokens for PROJECTOR_TYPE_QWEN25O, matching the tokens used for Qwen2.5
Omni, and ensure the tokens align with the HF tokenizer if they differ.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROJECTOR_TYPE_QWEN25O, // will be replaced by QWEN2A or QWEN25VL depending on clip_ctx

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for where PROJECTOR_TYPE_QWEN25O gets replaced or mapped
rg "QWEN25O" -A 5 -B 5

# Look for any mapping or replacement logic
rg "QWEN2A|QWEN25VL" -A 3 -B 3 | grep -E "(replace|map|convert|assign)"

# Search for projector type assignment/resolution logic
ast-grep --pattern $'if ($$$) {
  $$$
  proj_type = $$$
  $$$
}'

Length of output: 4939


@ngxson Thank you for the clarification! You're absolutely right. I found the replacement logic in tools/mtmd/clip.cpp:

// correct arch for multimodal models
if (model.proj_type == PROJECTOR_TYPE_QWEN25O) {
    model.proj_type = modality == CLIP_MODALITY_VISION
                        ? PROJECTOR_TYPE_QWEN25VL
                        : PROJECTOR_TYPE_QWEN2A;
}

This means that by the time init_audio() is called for a Qwen2.5 Omni audio context, the projector type has already been resolved to PROJECTOR_TYPE_QWEN2A, which is correctly handled by the existing conditional block that sets aud_beg = "<|audio_bos|>" and aud_end = "<|audio_eos|>".

My original comment was a false positive. The code is working correctly as designed. Thanks for pointing this out!


🧬 Code Graph Analysis Results

<no_relevant_code_snippets>


✏️ Learnings added
Learnt from: ngxson
PR: ngxson/llama.cpp#25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 26, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tools/mtmd/tests.sh (2)

72-74: Audio test still passed via --image; consider --audio for clarity

Although the CLI currently accepts an audio file on --image, this is counter-intuitive and may confuse contributors. Prefer adding a dedicated --audio flag (or reuse --media) and updating the test invocation accordingly.


107-118: Hard-coded binary name but flexible arrays – future-proofing idea

bin="llama-mtmd-cli" is hard-coded even though the legacy helper allowed passing a different binary.
If the intention is to test multiple binaries again, consider restoring the argument in the function signature and storing it in an arr_bin array.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf34f38 and d03c240.

📒 Files selected for processing (3)
  • tools/mtmd/clip-impl.h (2 hunks)
  • tools/mtmd/mtmd.cpp (13 hunks)
  • tools/mtmd/tests.sh (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
tools/mtmd/clip-impl.h (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
tools/mtmd/mtmd.cpp (1)
Learnt from: ngxson
PR: ngxson/llama.cpp#25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: windows-latest-cmake (vulkan-x64, -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_...
  • GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=c...
  • GitHub Check: windows-latest-cmake (llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-win...
  • GitHub Check: windows-latest-cmake (cpu-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-...
  • GitHub Check: windows-latest-cmake (openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-win...
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: pyright type-check
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: server-windows
  • GitHub Check: check-requirements
🔇 Additional comments (4)
tools/mtmd/clip-impl.h (1)

133-135: 🛠️ Refactor suggestion

⚠️ Potential issue

Adding PROJECTOR_TYPE_QWEN25O in-between existing enum values silently shifts every following numeric constant – this can break on-disk/ABI compatibility.

Many structures (e.g. GGUF metadata) serialise the raw integer of projector_type; inserting a new value in the middle changes those integers for every subsequent type (UNKNOWN used to be N, now it is N+1).
Unless the enum is guaranteed to be used only in memory, append new variants at the end or assign explicit values:

 enum projector_type {
@@
     PROJECTOR_TYPE_QWEN2A,
-    PROJECTOR_TYPE_QWEN25O, // will be replaced by QWEN2A or QWEN25VL depending on clip_ctx
-    PROJECTOR_TYPE_UNKNOWN,
+    PROJECTOR_TYPE_QWEN25O, // placeholder …
+    PROJECTOR_TYPE_UNKNOWN  = 0xFFFF,   // keep stable value
 };

Alternatively, add = <fixed_id> for every enum that is part of a serialised format.

Also applies to: 152-153

⛔ Skipped due to learnings
Learnt from: ngxson
PR: ngxson/llama.cpp#25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.635Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
tools/mtmd/mtmd.cpp (3)

148-151: media_marker emptiness check prevents accidental infinite splitting – good catch

The early guard ensures the tokenizer cannot enter an infinite loop when the delimiter is empty.
Looks correct as implemented.


162-172: Helpful dimension-mismatch guard

Verifying that the vision and audio projectors agree on n_embd prevents subtle buffer overruns later on.
Nice defensive programming!


805-808: Early-exit for Gemma-3 non-causal decode – concise and correct

The modality-specific flag is computed once and re-used – looks good.

Comment on lines +753 to 757
int n_mmproj_embd = ctx->n_embd_text;
ctx->image_embd_v.resize(chunk->tokens_audio->n_tokens * n_mmproj_embd);
bool ok = clip_image_batch_encode(
ctx->ctx_clip,
ctx->ctx_a,
ctx->n_threads,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use clip_n_mmproj_embd(ctx->ctx_a) instead of n_embd_text for robustness

n_mmproj_embd is derived from the text model embedding size, assuming it was already validated to match the audio projector.
If, in a future refactor, the validation path is bypassed (or conditional compilation removes vision support), this assignment could overflow / under-allocate image_embd_v.

-    int n_mmproj_embd = ctx->n_embd_text;
+    const int n_mmproj_embd = clip_n_mmproj_embd(ctx->ctx_a);

Keeps the function self-contained, reduces coupling, and makes the intent clearer.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int n_mmproj_embd = ctx->n_embd_text;
ctx->image_embd_v.resize(chunk->tokens_audio->n_tokens * n_mmproj_embd);
bool ok = clip_image_batch_encode(
ctx->ctx_clip,
ctx->ctx_a,
ctx->n_threads,
// allocate buffer using the actual audio projector dimension
- int n_mmproj_embd = ctx->n_embd_text;
+ const int n_mmproj_embd = clip_n_mmproj_embd(ctx->ctx_a);
ctx->image_embd_v.resize(chunk->tokens_audio->n_tokens * n_mmproj_embd);
bool ok = clip_image_batch_encode(
ctx->ctx_a,
ctx->n_threads,
🤖 Prompt for AI Agents
In tools/mtmd/mtmd.cpp around lines 753 to 757, the variable n_mmproj_embd is
assigned from ctx->n_embd_text, which may not accurately reflect the current
embedding size if the text and image models are decoupled or changed. Replace
the assignment with clip_n_mmproj_embd(ctx->ctx_a) to dynamically retrieve the
correct embedding size, ensuring the resize operation is robust and
future-proof. This change improves code clarity and reduces potential bugs
related to mismatched embedding dimensions.

Comment on lines 38 to 54
add_test_vision() {
local hf=$1
local tmpl=${2:-""} # default to empty string if not provided
arr_prefix+=("[vision]")
arr_hf+=("$hf")
arr_tmpl+=("$tmpl")
arr_file+=("test-1.jpeg")
}

add_test "llama-mtmd-cli" "ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0"
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0"
add_test "llama-mtmd-cli" "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "THUDM/glm-edge-v-5b-gguf:Q4_K_M"
add_test "llama-mtmd-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K" "vicuna"
add_test "llama-mtmd-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M" "vicuna"
add_test "llama-mtmd-cli" "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted
add_test "llama-mtmd-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K"
add_test "llama-mtmd-cli" "openbmb/MiniCPM-o-2_6-gguf:Q4_0"
add_test "llama-mtmd-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "ggml-org/InternVL2_5-1B-GGUF:Q8_0"
add_test "llama-mtmd-cli" "ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0"
add_test_audio() {
local hf=$1
arr_prefix+=("[audio] ")
arr_hf+=("$hf")
arr_tmpl+=("") # no need for chat tmpl
arr_file+=("test-2.mp3")
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

add_test was removed but is still referenced later – script will exit with command not found

Functions were renamed to add_test_vision / add_test_audio, yet lines 92-94 still invoke the old helper.
Running ./tests.sh huge (or big) will therefore fail before any test is executed.

-if [ "$RUN_HUGE_TESTS" = true ]; then
-    add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
-    add_test "llama-mtmd-cli" "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
+if [ "$RUN_HUGE_TESTS" = true ]; then
+    add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
+    add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
 fi

Also consider adding an add_test_huge() wrapper to keep naming symmetrical.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
add_test_vision() {
local hf=$1
local tmpl=${2:-""} # default to empty string if not provided
arr_prefix+=("[vision]")
arr_hf+=("$hf")
arr_tmpl+=("$tmpl")
arr_file+=("test-1.jpeg")
}
add_test "llama-mtmd-cli" "ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0"
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0"
add_test "llama-mtmd-cli" "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "THUDM/glm-edge-v-5b-gguf:Q4_K_M"
add_test "llama-mtmd-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K" "vicuna"
add_test "llama-mtmd-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M" "vicuna"
add_test "llama-mtmd-cli" "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted
add_test "llama-mtmd-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K"
add_test "llama-mtmd-cli" "openbmb/MiniCPM-o-2_6-gguf:Q4_0"
add_test "llama-mtmd-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M"
add_test "llama-mtmd-cli" "ggml-org/InternVL2_5-1B-GGUF:Q8_0"
add_test "llama-mtmd-cli" "ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0"
add_test_audio() {
local hf=$1
arr_prefix+=("[audio] ")
arr_hf+=("$hf")
arr_tmpl+=("") # no need for chat tmpl
arr_file+=("test-2.mp3")
}
if [ "$RUN_HUGE_TESTS" = true ]; then
add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
fi
🤖 Prompt for AI Agents
In tools/mtmd/tests.sh between lines 38 and 54, the script still references the
removed add_test function, which causes a command not found error. Update the
script to replace calls to add_test with the new function names add_test_vision
or add_test_audio as appropriate, or add a new add_test_huge wrapper for
consistency. This will prevent script failure before tests run.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
tools/mtmd/tests.sh (1)

129-133: ⚠️ Potential issue

Use correct input flag for audio vs. vision
Currently every test invocation uses --image, even for .mp3 files. Audio inputs should use the proper flag (e.g., --audio). Otherwise the CLI will error out on audio tests.

Example fix:

+    # choose flag based on file extension
+    if [[ "$inp_file" == *.mp3 ]]; then
+      input_flag="--audio"
+    else
+      input_flag="--image"
+    fi
     output=$(\
         "$PROJ_ROOT/build/bin/$bin" \
         -hf "$hf" \
-        --image $SCRIPT_DIR/$inp_file \
+        $input_flag $SCRIPT_DIR/$inp_file \
         -p "what is the publisher name of the newspaper?" \
         --temp 0 -n 128 \
         ${tmpl:+--chat-template "$tmpl"} \
♻️ Duplicate comments (1)
tools/mtmd/tests.sh (1)

96-97: add_test is still referenced for huge tests
The generic add_test helper was removed in prior refactoring. These calls will result in command not found.

🧹 Nitpick comments (1)
tools/mtmd/tests.sh (1)

47-53: Implement add_test_audio() helper
The audio helper correctly tracks model, template, and MP3 input. For consistency with the vision prefix, consider removing the trailing space in "[audio] ".

Suggested diff:

-    arr_prefix+=("[audio] ")
+    arr_prefix+=("[audio]")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d03c240 and ef48e8f.

📒 Files selected for processing (1)
  • tools/mtmd/tests.sh (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
  • GitHub Check: macOS-latest-cmake-visionos
  • GitHub Check: android-build
  • GitHub Check: ubuntu-22-cmake-vulkan
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-22-cmake-sycl
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server-windows
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: check-requirements
🔇 Additional comments (7)
tools/mtmd/tests.sh (7)

33-36: Initialize arrays for test metadata
The new arr_prefix and arr_file arrays correctly complement the existing arr_hf and arr_tmpl arrays to track per-test prefixes and input files.


38-45: Implement add_test_vision() helper
This function cleanly appends vision-model entries (HF path, template, default JPEG file, and “[vision]” prefix) to the test arrays.


55-73: Populate initial vision and audio tests
Your series of add_test_vision and add_test_audio invocations comprehensively covers small to mid-sized models. Listing them explicitly aids clarity; all entries look correct.


77-89: Populate BIG-model vision and audio tests
The “big” block adds larger vision and audio tests consistently. Commenting out the known‐bad 32B model is a helpful flag for future runs.


111-111: Build the CLI binary before testing
Invoking cmake --build ... --target llama-mtmd-cli ensures the latest changes are compiled. This placement is correct.


115-121: Iterate over registered tests
Looping over arr_hf (and corresponding arr_prefix, arr_tmpl, arr_file) is a clean, data-driven approach. The inp_file assignment is correct.


137-141: Evaluate and format test results
The grep-based check for “new york” and ANSI-colored OK/FAIL strings work as expected. Consider centralizing the success pattern or making it configurable in the future.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tools/mtmd/tests.sh (2)

47-53: Normalize arr_prefix formatting in add_test_audio().

Currently you append "[audio] " (with a trailing space), whereas add_test_vision() uses "[vision]". This causes inconsistent spacing in the final log lines:

Suggestion: remove the trailing space and let the formatting layer insert spaces uniformly.

-arr_prefix+=("[audio] ")
+arr_prefix+=("[audio]")

55-73: Refactor repetitive model registrations.

There’s a long list of add_test_vision/add_test_audio calls. Consider DRYing this up by defining an array of model specs (including modality and optional template) and looping over it to call the appropriate helper. This will improve maintainability when adding or removing models.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef48e8f and 94d893d.

📒 Files selected for processing (1)
  • tools/mtmd/tests.sh (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (19)
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-vulkan-cross
  • GitHub Check: build-linux-cross / ubuntu-24-ppc64el-vulkan-cross
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-cross
  • GitHub Check: ubuntu-22-cmake-musa
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: windows-2019-cmake-cuda (12.4)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: server-windows
  • GitHub Check: server (Release)
  • GitHub Check: server (ADDRESS, RelWithDebInfo)
  • GitHub Check: server (UNDEFINED, RelWithDebInfo)
  • GitHub Check: check-requirements
🔇 Additional comments (4)
tools/mtmd/tests.sh (4)

28-28: Skip echo change.

The updated message clarifies inclusion of both BIG and HUGE models but doesn’t introduce any functional change that requires review.


33-36: Array initialization is sound.

You’ve correctly introduced arr_file alongside the existing arrays to track input filenames. The empty initializations ensure safe appends later in the script.


38-45: add_test_vision() helper is well-structured.

  • Uses local for parameters.
  • Provides a sensible default for tmpl.
  • Appends all four arrays in lockstep, including the new arr_file entry for vision inputs.

137-141: Verify output check works for both modalities.

You grep new york to decide pass/fail indiscriminately. Audio tests may produce different expected keywords or confidence markers. Please confirm that "new york" is the right heuristic for audio outputs, or parameterize the success pattern per modality.

"$PROJ_ROOT/build/bin/$bin" \
-hf "$hf" \
--image $SCRIPT_DIR/test-1.jpeg \
--image $SCRIPT_DIR/$inp_file \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Audio inputs are passed to --image flag—likely a bug.

The invocation always uses --image $SCRIPT_DIR/$inp_file, so audio tests will feed .mp3 files to the image flag. You must branch on modality (or file extension) to use the correct CLI flag, for example --audio for MP3s.

Apply a conditional here, for example:

-        --image $SCRIPT_DIR/$inp_file \
+        $( if [[ "${inp_file##*.}" == "mp3" ]]; then
+             echo "--audio $SCRIPT_DIR/$inp_file"
+           else
+             echo "--image $SCRIPT_DIR/$inp_file"
+           fi ) \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--image $SCRIPT_DIR/$inp_file \
$( if [[ "${inp_file##*.}" == "mp3" ]]; then
echo "--audio $SCRIPT_DIR/$inp_file"
else
echo "--image $SCRIPT_DIR/$inp_file"
fi ) \
🤖 Prompt for AI Agents
In tools/mtmd/tests.sh at line 129, the script always passes audio files to the
--image flag, which is likely a bug since --image is intended for image inputs.
To fix this, add a conditional check based on the file extension or modality to
use --audio for MP3 files and --image for image files. Implement the condition
before the invocation to select the correct flag dynamically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant